Team: Surface

Ruozi Song (rs3581)

Shengyang Zhang (sz2624)

Haiqi Zhu (hz2432)

Note:

  1. All our source codes and data can be found here: https://github.com/RuoziSong/Surface

  2. The published Shiny App can be found here : https://surface-eda.shinyapps.io/Surface/

1. Introduction

On the restaurant searching website like Yelp, the information of restaurant health grade is always missing. Thus we want to come up with a program that focus on the health grade/socre of restaurants in the NYC. Moreover, instead of just plotting graphs or come up with a simple grade mark, we want to come up with an interactive project which can tell us about the health grade history of each restaurant, and multiple comparation of each restaurant to the same type/location restaurant. In addition, we also want our project be more mimic and more attractive tool to our user who want to analyse such problems.

Based on these concepts, we found a related data called DOHMH New York City Restaurant Inspection Results on the NYC OpenData with the size of 154.8MB and 424224 records (can be found at https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/xx67-kt59/data). This dataset perfectly matches our needs, since this dataset contains the location of the restaurant as well as the time stamps of each inspection. Note that this dataset is a relatively clean but has big amount of missing data.

The finished project is a map related program which studies the restaurant environmental condition and health grades/scores in the NYC. Furthermore, by clicking the restaurant on the map, we can see this grade/score, type of violation, type of restaurant, etc. as well as its history and other related visualized data.

2. Team

The team Surface has three members, Ruozi Song, Shengyang Zhang, and Haiqi Zhu.

The project work was equally distributed to three members. We split the work load mainly by the different topics we want to study.

Ruozi was mainly working on the data cleaning and coding python app to convert address in the data set to longitude and latitude to locate restaurants through google map. Besides data cleaning, her contribution is mainly focus on the second panel ‘Analysis - Inspection score by zip code’. She also finished the popup part of the first panel ‘Restaurant locator’.

Shengyang was mainly focused on the map part of the first panel “interactive map” including create the map and update given data as required on the map with relocation. He also finished the third panel “analysis”.

Haiqi contributed the plot and the majority of UI part of the first panel “interactive map”. He also come up with the idea of the cute icon and coded it himself. Haiqi also contributed a lot on collect code from the group and putting code together.

3. Analysis of Data Quality

As mentioned in part 1, the data set we are using is DOHMH New York City Restaurant Inspection Results on the NYC OpenData with the size of 154.8MB and 424224 records.

The data is relatively clean and well formed. Thus we do not need to write regular expression to clean it from the beginning. However, there is a lot of missing value in this data especially in column “GREAD”, “GRADE DATE”, and “SCORE”.

library(extracat)
## 
## This data.table install has not detected OpenMP support. It will work but slower in single threaded mode.
data <- read.csv("restaurant1.csv", na.strings = '')
visna(data,sort="c")

Due to the large missing of “GRADE” and “GRADE DATE”, we will focus our analysis on SCORE which GRADE was based on. Furthermore, during analysis, we also found some other things unclean in the data. We will explain them in detail in part 5.

4. Executive Summary

Note:

The professor said that for us using Shiny, we can describe just the main functions of our app in part 4 and then give well-organized analysis and results in part 5.

4.1 Individual restaurant information locator

In the restaurant locator page, input the zipcode, direction (borough) and the cuisine you want to eat, then it will show all restaurants meeting the requirement on the map. Click one restaurant on the map, it will show the details of this restaurant including specific location, cuisine type and phone number. Below the map, the historical inspection records of this particular restaurant will be present.

(citation: https://github.com/rstudio/shiny-examples/tree/master/063-superzip-example)

4.2 Inspection performance among different neighborhoods

In addition to the hygiene condition of individual restaurants, we are also concerned about how is the overall hygiene condition between different neighborhoods. Here, we draw a filled map over all zip codes area in New York by overall hygiene condition.

We use zip code to represent a neighborhood since the boundaries of zip codes were well defined while the boundaries of neighborhoods were not. To gain more information from the plot, we add three parameters, year, measure and violation type. For example, we want to see in year 2017, how is the overall hygiene condition in terms of violation type 03, which is problem of food source. For restaurants within each zip code, we calculate their mean/median inspection scores. Here we enable the options of mean or median to see whether there are restaurants with extreme scores to scale the result.

(citation: https://github.com/thisisnic/Shiny-London-Income-Map)

4.3 Other analysis

Besides the overall hygiene condition between different neighborhoods, we also want to know how is the overall hygiene condition among different types of restaurants, different types of violations and over time. We conduct this kind of analysis in panel 3.

5. Main Analysis

5.1 Individual restaurant information locator

Say someone wants to choose a restaurant to eat and cares about the hygiene situation of the restaurant.

In this page, user can input the name of restaurant and check the inspection results. For example, user wants to eat subway around Columbia. Input the subway and location, it shows three subway restaurants on the map.

Check these three subway, the one on 348 MALCOLM X BOULEVARD has food temperature, hygiene and container problems and the inspection score is very high which means it sanitation is not very good.

And the one on 578 W 125th ST, it only has food container problem and the score is very low on 2017. Therefore, according to the analysis result. The user will prefer this one.

If user does not know which specific restaurant to eat and input the cuisine type and location, It will present all restaurants on the map and user can click each of them to see the analysis of sanitation condition and then choose which one to eat. Generally, we show inspection history result and compare this restaurant with other same cuisine type restaurants in this location. User can choose the restaurant based on these information.

For example, when user wants to eat African food near Columbia, input the information and it shows five restaurants.

For clicks ACCORA.RESTAURANT, the line graph shows this restaurant inspection score history. The lower score is better. Its hygiene is good in 2015 but not very good recently. And during the inspection, this restaurant has food temperature, facility and hygiene issues. In bar chart, compared with other near African restaurants inspection results, ACCORA RESTAURANT has more serious issue on food temperature.

For clicks J.RESTAURANT, the line graph shows this restaurant inspection score history. The lower score is better. For J. RESTURANT, its hygiene is good in 2015. And during the inspection, this restaurant has food temperature, food source and hygiene issues. In bar chart, compared with other near African restaurants inspection results, J.RESTAURANT has more serious issues on food resource and temperature.

We use leaflet package to show the map. And use popup to show the restaurant details including location, zipcode, cuisine type and phone number. When user clicks one restaurant, it will show the analysis of inspection result. The difficult part is to implement the interaction between ui.R and server.R. When user choose one restaurant on the map, it will return the id of this restaurant and process data by this id in the server.R.

5.2 Inspection performance among different neighborhoods

To get a sense of the overview hygiene condition of all restaurants in New York City, we firstly think of a map to display where the restaurants’ hygiene condition is better and where is worse. Thus, we use a filled map to display the mean/median inspection scores for all neighborhoods by colors.

One challenge we met was to find neighborhood boundaries in New York City. We found out that the neighborhood boundaries were not well defined. For example, Google maps and the neighborhood boundaries on the real estate website Streeteasy gives different boundary for Harlem. Also, neighborhood area may be too large to get detail information of smaller areas. Then we decided to use zip codes as to divide areas. Zip codes boundaries were well defined and their size is reasonable to be an area to study.

Secondly, since we have data of multiple years, we would like to see whether the hygiene condition improved over the years and where they are. So we add the parameter year to enable us to see the yearly change. From the data, we have complete and decent number of inspections after year 2013, thus we didn’t use the data of year 2012.

In terms of the yearly change, the performance of restaurants on food temperature was quite stable over the years. Overall, the restaurants in NYC performed well on reasonable food temperature. However, when looking at the quality of food source, we can see that the map shows more dark areas in year 2017 than previous years, which means that the quality of food source in the first quarter of year 2017 got worse compared to previous years.

Also, different people may weigh the violation type differently. For example, I think the problem of food source is more serious than whether the food is too hot to be served. But one of our group member has different thoughts. So we add another parameter, violation type, to enable us to see the performance of restaurants under different type of inspections. Another interesting finding from this is to see which kind of sanitation issue is more serious for most restaurants.

When looking at the different violation codes, we found that there are 75 different kinds of violations. With closer look, we found these violations can be reclassified into smaller number of types, which is easier to understand. For example, all violations with ‘02’ in violation code are related to the problem with food source. So we made them into the class ‘Problem with food source’.

library(dplyr)
data_score <- data[which(!is.na(data$SCORE)),]
data_score <- data_score[which(!is.na(data_score$VIOLATION.CODE)),]
data_score$VIOLATION.TYPE <- sapply(data_score$VIOLATION.CODE, substring, 1, 2)
type <- data_score %>% select(VIOLATION.TYPE, VIOLATION.CODE,VIOLATION.DESCRIPTION)%>%
  arrange(VIOLATION.TYPE)
type <- unique(type)
type[1:5,]
##   VIOLATION.TYPE VIOLATION.CODE
## 1             02            02B
## 3             02            02E
## 4             02            02A
## 5             02            02G
## 6             02            02H
##                                                                                                                                                                                                                               VIOLATION.DESCRIPTION
## 1                                                                                                                                                                                            Hot food item not held at or above 140\x92\xe4\xce_ F.
## 3                                                                                                                                 Whole frozen poultry or poultry breasts, other than a single portion, is being cooked frozen or partially thawed.
## 4                                                                                                                                                                                                  Food not cooked to required minimum temperature.
## 5                                                                                          Cold food item held above 41\x92\xe4\xce_ F (smoked fish and reduced oxygen packaged foods above 38 \x92\xe4\xce_F) except during necessary preparation.
## 6 Food not cooled by an approved method whereby the internal product temperature is reduced from 140\x92\xe4\xce_ F to 70\x92\xe4\xce_ F or less within 2 hours, and from 70\x92\xe4\xce_ F to 41\x92\xe4\xce_ F or less within 4 additional hours.

Here, we find it interesting that among all different violation types, the variance of hygiene conditions about food source among all neighborhoods is the largest. For other violation type, the scores are relatively lower, which means the violation issues are not that serious. So when we are concerned about hygiene condition, we should pay attention to the quality of the food source.

5.3 Other analysis

The third panel is the analysis which is not related to map and location. We use traditional bar plot as the tool and we are focusing on restaurant type and violation type.

For the restaurant type we first find and order it by the number of inspection. This plot can tells us the distribution of restaurant in our data set. Note that the American and Chinese restaurants is the majority of our data set.

library(ggplot2)
ggplot(data[which(!is.na(data$CUISINE.DESCRIPTION)),], aes(x = factor(CUISINE.DESCRIPTION, levels = names(sort(table(CUISINE.DESCRIPTION), decreasing = TRUE)))))+ geom_bar(stat='count')+labs(x="Restaurant Type",y="number of inspection",title="Restaurant Type Distribution")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))

The second thing and the most important thing of the restaurant type we are trying to find is the mean score of each restaurant type. From the plot below, we can see that Creole restaurants has the highest mean score of 22.63 which equals to B grade and the lowest mean score is 8.130435 which is the ‘Soups’ restaurant. Note that the lower the score is the cleaner enviroment the restaurant can provide.

ggplot(aggregate(SCORE~CUISINE.DESCRIPTION, data[which(!is.na(data$SCORE)),], mean),aes(x=factor(CUISINE.DESCRIPTION, levels = CUISINE.DESCRIPTION[order(SCORE)]),y=SCORE))+geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+labs(x="Restaurant Type",y="Mean of Score",title="Mean Score of Restaurant Type")

The third thing we are looking for is what is the most common critical violation of each restaurant type. From the graph we can see that the 10F is the most common critical violation in most of the restaurant. Note that violation code 10F is “Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit”. We can tell that this is the common health problem in the restaurants of NYC.

most_vio_rt<-data[!is.na(data$VIOLATION.CODE),][data[, "CRITICAL.FLAG"] == "Critical",] %>% count(CUISINE.DESCRIPTION, VIOLATION.CODE) %>%slice(which.max(n))
ggplot(most_vio_rt, aes(x = factor(CUISINE.DESCRIPTION, levels = CUISINE.DESCRIPTION[order(-n)]),y=n))+ geom_bar(stat='identity') +labs(x="Restaurant Type",y="Number of Violation Type",title="The Most Common Critical Violation in Each Restaurant Type")+geom_text(aes(label=VIOLATION.CODE),size=1.5,vjust=-0.5)+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 1 rows containing missing values (geom_text).

Then we do the similar analysis on the violation type. We first find and order it by the number of inspection. This plot can tells us the distribution of violation type in our data set. As expected, 10F is the most violated problem. And the violated problems is most the top 9 problems.

ggplot(data[which(!is.na(data$VIOLATION.CODE)),], aes(x = factor(VIOLATION.CODE, levels = names(sort(table(VIOLATION.CODE), decreasing = TRUE)))))+ geom_bar(stat='count')+labs(x="Violation Type",y="number of inspection",title="Violation Type Distribution")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))

The second thing of the violation type we are trying to find is the mean score of each violation type. This represents the harmfulness of each violation type. We can see that the most harmless violation type is 20F (Current letter grade card not posted) with score 7 and the most critical violation type is 06H (Records and logs not maintained to demonstrate that HACCP plan has been properly implemented) with score 60.56. Note that the most common violation type 10F has score 13.42 which is still in A range.

ggplot(aggregate(SCORE~VIOLATION.CODE, data[which(!is.na(data$SCORE)),], mean),aes(x=factor(VIOLATION.CODE, levels = VIOLATION.CODE[order(SCORE)]),y=SCORE))+geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+labs(x="Violation Type",y="Mean of Score",title="Mean Score of Violation Type")

The last plot of the third panel is the distribution of inspection over time. Our data is from 2012-2017 and the majority of the data is form 2013-2017.

data$INSPECTION.DATE <- as.Date(data$INSPECTION.DATE, format="%m/%d/%y")
data <- data[which(data$INSPECTION.DATE != '2000-01-01'),]
data <- data[data[, "lat"] != 0,]
data <- data[data[, "ing"] != 0,]
data$YEAR <- as.numeric(format(data$INSPECTION.DATE, "%Y")) 
ggplot(data[which(!is.na(data$INSPECTION.DATE)),],aes(INSPECTION.DATE))+geom_histogram(binwidth = 30)+labs(x="Inspection Date",y="number of inspection",title="Number of Inspection over Time")

Also, notice that we want to create a map in our shiny application, thus we will need data of longitute and latitude of each restaurant which is not provided in this data set. Thus we wrote a python file to get such information through google map. The Python code is listed below.

#!/usr/bin/python
import csv
import requests
import time
import json

appid = "35348ee9"
appkey = "874bc0a8aaafe29bbe84abaeb78fd57d"
with open('restaurant.csv', 'rb') as readfile, open('restaurant1.csv', 'a') as writefile:
 writer = csv.writer(writefile, delimiter=',')
 reader = csv.reader(readfile, delimiter=',')
 last_address = ""
 last_coord = (0, 0)

 for row in reader:
  if row[0] == "CAMIS":
   writer.writerow(row + ['lat', 'ing'])
   continue

  address = row[3] + row[4] + row[2]

  if address == last_address:
   writer.writerow(row + [last_coord[0], last_coord[1]])
   continue

  last_address = address
   
  url = "https://api.cityofnewyork.us/geoclient/v1/address.json?houseNumber={0}&street={1}&borough={2}&app_id={3}&app_key={4}".format(row[3], row[4], row[2], appid, appkey)

  while True:
   try:
    r = requests.get(url).json()
    lat = r['address']['latitude']
    lon = r['address']['longitude']
    writer.writerow(row + [lat, lon]) 
    print (lat, lon)
    break
   except Exception as e:
    print e
    print address
    writer.writerow(row + ['0.0', '0.0'])
    break

#https://api.cityofnewyork.us/geoclient/v1/address.json?houseNumber=30&street=ROCKFEELLER PLAZA&borough=Manhattan&app_id=35348ee9&app_key=874bc0a8aaafe29bbe84abaeb78fd57d

6. Conclusion

Due to the limitation of time, the current shiny application is not perfect and has very limited functions. For example, the marker is not easy to identify and click for each restaurant due to the huge amount of markers on the map when search an area or zipcode. If we have more time/co-workers, we can only show the top 20 of the area or zip-code while having a “next 20” bottom to make the map more readable and clickable.

The lessons we learned is that it is very important to have uniform variable names. One team should definately have a document to store such information online. Moreover, one shouold never change the original data set unless everyone changes it. We had a lot of problem while putting code together since everyone reform the not well formed data (normally date and time) in different ways.